A Treebank of Spanish and its Application to Parsing
نویسندگان
چکیده
This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. However, when the project started, at the end of 1997, there was no syntactically annotated corpus for Spanish. This paper describes the design of such a treebank and its initial application to parser construction. 1. Constructing a Spanish treebank 1.1. Preliminary considerations This paper presents joint research between a Spanish team and an American one on the development and exploitation of a Spanish treebank. Such treebanks for other languages have proven valuable for the development of high-quality parsers and for a wide variety of language studies. As there was no previous experience in building a syntactically annotated corpus for Spanish, the first effort consisted necessarily in writing a set of annotation guidelines. The starting point was the existing documentation at that time, especially the Penn Treebank project (Marcus, Santorini and Marcinkiewicz, 1993; Bies et al., 1995), the EAGLES preliminary recommendations (EAGLES, 1996), and the Negra corpus (Skut et al., 1997). Our experience in developing Spanish NLP systems told us that a pure phrase structure annotation (typical of the English treebanks) would not be enough for inducing relevant rules for Spanish. At the least, information about agreement and syntactic functions is necessary for Spanish, and we wanted to incorporate that information in our trees in the form of features. The treebank has been created mostly by hand, although some automatic pre-tagging of the data is performed, as described below, to speed treebank creation.
منابع مشابه
Extracting LTAG Grammars from a Spanish Treebank
Treebank grammars have been known to help in building robust, wide-coverage statistical parsers that also obtain state-of-art accuracies. In this work, we present a system that extracts LTAG grammars for Spanish from a constituency-based Spanish treebank. We evaluate the extracted grammar in terms of its size, its coverage on unseen data and the performance of a supertagger trained on it. The s...
متن کاملInteractive Predictive Parsing Framework for the Spanish Language
The Interactive Predictive Parsing (IPP) framework allows us the construction of interactive tree annotation systems. These can help human annotators in creating error-free parse trees with little effort (compared to manually post-editing the trees obtained from a completely automatic parser). In this paper we adapt the IPP framework and the IPP-Ann annotation tool for parse of the Spanish lang...
متن کاملتصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور
The Treebank is one of the most useful resources for supervised or semi-supervised learning in many NLP tasks such as speech recognition, spoken language systems, parsing and machine translation. Treebank can be developded in different ways that could be, generally, categorized in manually and statistical approaches. While the resulted Treebank in each of these methods has the annotation error,...
متن کاملExploring Morphosyntactic Annotation over a Spanish Corpus for Dependency Parsing
It has been observed that the inclusion of morphosyntactic information in dependency treebanks is crucial to obtain high results in dependency parsing for some languages. In this paper we explore in depth to what extent it is useful to include morphological features, and the impact of diverse morphosyntactic annotations on statistical dependency parsing of Spanish. For this, we give a detailed ...
متن کاملStatistical Parsing of Spanish and Data Driven Lemmatization
Although parsing performances have greatly improved in the last years, grammar inference from treebanks for morphologically rich languages, especially from small treebanks, is still a challenging task. In this paper we investigate how state-of-the-art parsing performances can be achieved on Spanish, a language with a rich verbal morphology, with a non-lexicalized parser trained on a treebank co...
متن کامل